A few words on what makes good code

for any given problem there are many different solutions and paths. While some paths may be more efficient or shorter than others. There are two necessary conditions for good code. Good code must be:

  1. Functional
  2. Easy to understand

One can not come at the expense of the other. These principles are behind every decision and form the backbone of every script in every language. I am confident you will see this reflected in my work.

Styling

One of the most important ways to create good code is to follow good coding practices from the beginning, as going back to fix things will always be costlier than starting out the right way. Throughout this report you will see blue text boxes that explain some of the styling decisions made throughout the script to guarantee the functionality and readability.

This is an example of a style textbox.

Formal writing and academic writing disclamer XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX plural first person active voice is used to engage with the reader

Reproducibility & Portability

Making sure your code is reproducible and portable is also essential for good code. I always create a new R-project for each assignment, maintain Renvironments through renv and detailed records of every change through version control (as you can probably tell by reading this document on GitHub). In fact, this project not only has renv to increase its reproducibility and portability it also contains a mamba directory with the .Rprofile and config.yml files needed to guarantee that no matter when or in what system, this project is 100% reproducible and portable. Just remeber not to use mamaba and renv as they can conflict with each oter.

Feedback

Fedback here XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Problem statement

This is a question I came across while completing a questionnaire for the World Bank’s Development and Impact Evaluation (DIME). While not a verbatum quote, the question was:

There is a program that is implemented at the village level. Households within the same village are very similar but households between villages are not. To maximize the likelihood of detecting the programs effect is it better to sample more households within each village or to sample more villages?

Intuitively one may think that it s better to sample more villages. If households within each village are similar then the information that an additional household from a village that has already been sampled contributes to the regression’s power is less than a household from village that is unsampled and which there for, different to all the other households in the sample.

Visualizing the result of different sampling strategies

As mentioned above, intuitively one might expect that sampling households from different villages would increase the statistical significance of the estimator. Lets take a look at the first graph. It’s worth saying that all these graphs are interactive yo you may pan, rotate, zoom, etc. as well as hover over the plot to see the number of villages per each treatment group (i.e. treated and control), total sample size and the p-vale with ’*’ at each of the usual significance thresholds (10%, 5%, 1%).

From this graph it is not immediately obvious that either sampling strategy is better than the other. In fact, it seems as if the surface descends at the same rate regardless if you are increasing the number of households per village or if you are increasing the number of villages per treatment group. Additionally, the surface of this graph is very rough, there are many local maxima and local minima scattered throughout. This is of course expected as a result from idiosyncratic errors. However it is nevertheless surprising as this graph shows the average p-value over 1000x runs.

Let’s now see how this graph changes as we increase the effect size. Once again I encourage you to explore each graph.

Getting the results